Reshaping data with `stack` and `unstack`



In [ ]:

    
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
    import seaborn
except ImportError:
    pass

pd.options.display.max_rows = 8

Case study: air quality data of European monitoring stations (AirBase)

Going further with the time series case study test on the AirBase (The European Air quality dataBase) data: the actual data downloaded from the Airbase website did not look like a nice csv file (data/airbase_data.csv).

One of the actual downloaded raw data files of AirBase is included in the repo:



In [ ]:

    
!head -1 ./data/BETR8010000800100hour.1-1-1990.31-12-2012

Just reading the tab-delimited data:



In [ ]:

    
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t')#, header=None)



In [ ]:

    
data.head()

The above data is clearly not ready to be used! Each row contains the 24 measurements for each hour of the day, and also contains a flag (0/1) indicating the quality of the data.

EXERCISE: Clean up this dataframe using more options of `read_csv`

specify that the values of -999 and -9999 should be regarded as NaN
specify are own column names (http://stackoverflow.com/questions/6356041/python-intertwining-two-lists)



In [ ]:



In [ ]:

For now, we disregard the 'flag' columns



In [ ]:

Now, we want to reshape it: our goal is to have the different hours as row indices, merged with the date into a datetime-index.

Intermezzo: reshaping your data with `stack`, `unstack` and `pivot`

The docs say:

Pivot a level of the (possibly hierarchical) column labels, returning a DataFrame (or Series in the case of an object with a single level of column labels) having a hierarchical index with a new inner-most level of row labels.



In [ ]:

    
df = pd.DataFrame({'A':['one', 'one', 'two', 'two'], 'B':['a', 'b', 'a', 'b'], 'C':range(4)})
df

To use stack/unstack, we need the values we want to shift from rows to columns or the other way around as the index:



In [ ]:

    
df = df.set_index(['A', 'B'])
df



In [ ]:

    
result = df['C'].unstack()
result



In [ ]:

    
df = result.stack().reset_index(name='C')
df

pivot is similar to unstack, but let you specify column names:



In [ ]:

    
df.pivot(index='A', columns='B', values='C')

pivot_table is similar as pivot, but can work with duplicate indices and let you specify an aggregation function:



In [ ]:

    
df = pd.DataFrame({'A':['one', 'one', 'two', 'two', 'one', 'two'], 'B':['a', 'b', 'a', 'b', 'a', 'b'], 'C':range(6)})
df



In [ ]:

    
df.pivot_table(index='A', columns='B', values='C', aggfunc='count') #'mean'

Back to our case study

We can now use stack and some other functions to create a timeseries from the original dataframe:



In [ ]:

    
colnames = ['date'] + [item for pair in zip(["{:02d}".format(i) for i in range(24)], ['flag']*24) for item in pair]

data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012",
                   sep='\t', header=None, na_values=[-999, -9999], names=colnames)
data = data.drop('flag', axis=1)



In [ ]:

    
data.head()

EXERCISE: Reshape the dataframe to a timeseries

The end result should look like:

	BETR801
1990-01-02 09:00:00	48.0
1990-01-02 12:00:00	48.0
1990-01-02 13:00:00	50.0
1990-01-02 14:00:00	55.0
...	...
2012-12-31 20:00:00	16.5
2012-12-31 21:00:00	14.5
2012-12-31 22:00:00	16.5
2012-12-31 23:00:00	15.0

170794 rows × 1 columns

First, reshape the dataframe so that each row consists of one observation for one date + hour combination:



In [ ]:



In [ ]:



In [ ]:

Now, combine the date and hour colums into a datetime (tip: string columns can be summed to concatenate the strings):



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

We can also use this with the movie data



In [ ]:

    
cast = pd.read_csv('data/cast.csv')
cast.head()



In [ ]:

    
titles = pd.read_csv('data/titles.csv')
titles.head()

EXERCISE: Define a year as a "Superman year" whose films feature more Superman characters than Batman. How many years in film history have been Superman years?



In [ ]:



In [ ]:

EXERCISE: Plot the number of actor roles each year and the number of actress roles each year over the history of film.



In [ ]:

EXERCISE: Plot the number of actor roles each year and the number of actress roles each year, but this time as a kind='area' plot.



In [ ]:



In [ ]:

Reshaping data with stack and unstack

Case study: air quality data of European monitoring stations (AirBase)

Intermezzo: reshaping your data with stack, unstack and pivot

Back to our case study

We can also use this with the movie data

Reshaping data with `stack` and `unstack`

Intermezzo: reshaping your data with `stack`, `unstack` and `pivot`